feat: get block ids with load errors #5668

cheese-head · 2026-01-27T00:39:16Z

Overview:

Add infrastructure to track and report block IDs that fail during asynchronous KV cache load operations in the KVBM connector. This enables vLLM to identify failed blocks and take corrective action (e.g., mark for recomputation).

Details:

Add block_ids field to WorkerTransferRequest protocol for tracking which blocks belong to each transfer operation
Add failed_block_ids field and get_block_ids_with_load_errors() method to KvConnectorWorker
Expose the method through Python bindings (connector_worker.py, dynamo_connector.py)
Add failure_tx/failure_rx channel between Scheduler and WorkerSchedulerClient to propagate transfer failures
Add drain_failures() method to collect pending failure notifications
Track request_id → uuid → block_ids mapping in worker to convert failure notifications to block IDs
Update handle_immediate_result() to send failure notifications when result.status.is_err()
Clean up tracking state when requests complete

Where should the reviewer start?

lib/llm/src/block_manager/connector/scheduler.rs - Core failure notification channel implementation
lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs - Block ID tracking and error aggregation logic
lib/llm/src/block_manager/connector/protocol.rs - Protocol changes (block_ids field)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to GitHub issue: feat: distributed kvbm #5243

Summary by CodeRabbit

New Features
- End-to-end error tracking for block load failures with a public way to retrieve failed block IDs for diagnostics.
- Transfer operations now include block identifiers to improve error correlation during transfers.
- Scheduler and workers gain failure signaling so load failures are propagated and recorded.
Tests
- Updated tests to accommodate the new failure-tracking fields and behaviors.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add infrastructure to track block IDs that fail during async KV cache loading, enabling vLLM to identify and handle failed blocks gracefully. Changes: - protocol.rs: Add block_ids field to WorkerTransferRequest - scheduler.rs: Update tests with new block_ids field - slot.rs: Populate block_ids for onboard/offload operations - worker.rs: Add failed_block_ids tracking and get_block_ids_with_load_errors method - connector_worker.py: Expose get_block_ids_with_load_errors to Python - dynamo_connector.py: Implement KVConnectorBase_V1 interface method Signed-off-by: Patrick Riel <priel@nvidia.com>

…acking - Add failure_tx/failure_rx channel between Scheduler and WorkerSchedulerClient - Add drain_failures() method to collect pending failure notifications - Track request_id -> uuid -> block_ids mapping in KvConnectorWorker - Populate failed_block_ids from scheduler failure notifications - Clean up block_ids tracking when requests complete Signed-off-by: Patrick Riel <priel@nvidia.com>

copy-pr-bot · 2026-01-27T00:39:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Patrick Riel <priel@nvidia.com>

coderabbitai · 2026-01-27T00:49:51Z

Walkthrough

Adds end-to-end propagation and retrieval of block load failure IDs: transfer requests carry block_ids, scheduler signals failures, workers collect and expose failed IDs, and Python bindings provide get_block_ids_with_load_errors().

Changes

Cohort / File(s)	Summary
Python connector API `lib/bindings/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py`, `lib/bindings/kvbm/python/kvbm/vllm_integration/connector_worker.py`	Added `get_block_ids_with_load_errors()` to `DynamoConnector` and `KvConnectorWorker`; method forwards/returns failed block IDs (returns empty set if no worker).
Rust worker tracking & bindings `lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs`	Added per-request `request_to_blocks` mapping and `failed_block_ids` set; introduced `process_pending_failures`, `cleanup_onboarding_request`, and `get_block_ids_with_load_errors()`; exposed to Python via `PyKvConnectorWorker`.
Rust transfer protocol `lib/llm/src/block_manager/connector/protocol.rs`	Added `pub block_ids: Vec<usize>` to `WorkerTransferRequest` with `#[serde(default)]` for error-tracking metadata.
Rust scheduler & failure signaling `lib/llm/src/block_manager/connector/scheduler.rs`	Added `failure_tx`/`failure_rx` channels, `drain_failures()` on `WorkerSchedulerClient`, and failure-send logic in immediate-transfer handling; tests updated to include `block_ids`.
Rust leader transfer paths `lib/bindings/kvbm/src/block_manager/vllm/connector/leader/slot.rs`	Capture and forward `block_ids` into `WorkerTransferRequest` during onboard/offload to enable downstream error correlation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble logs and follow tracks of bits,
I gather IDs of blocks that missed their hits.
I carry them home in a tidy heap,
Then hand them back so the makers can keep —
a garden whole where transfers fit.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the main feature being added: a method to retrieve block IDs that encountered load errors during async operations.
Description check	✅ Passed	The description is comprehensive and well-structured, following the template with clear overview, detailed technical changes, reviewer guidance, and related issue reference.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs`:
- Around line 293-305: When enqueuing onboarding_operations we currently stash
request->blocks in request_to_blocks but later remove that mapping in
get_finished before we call get_block_ids_with_load_errors, which can drop the
failure->block correlation; fix by ensuring failures are drained/translated into
block IDs before cleanup: call get_block_ids_with_load_errors (or a shared
helper that drains scheduler failures and maps them to block IDs using
request_to_blocks) prior to removing the request's entry from request_to_blocks
in get_finished; factor the draining/mapping logic into a reusable helper (e.g.,
drain_and_map_load_failures) and use it from both the onboarding loop code paths
and the get_finished cleanup to preserve failure→block mappings.

🧹 Nitpick comments (1)

lib/llm/src/block_manager/connector/protocol.rs (1)
152-154: Confirm whether #[serde(default)] should mask missing block_ids.
Defaulting to empty accepts older senders but silently drops error tracking. If deployments are atomic and you want strictness, consider making the field required so deserialization fails fast when it’s absent.
💡 Possible tweak to keep the protocol explicit
-    #[serde(default)]
-    pub block_ids: Vec<usize>,
+    pub block_ids: Vec<usize>,
Based on learnings, keep the spec explicit if backward compatibility isn't required.

lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs

cheese-head · 2026-01-28T03:24:31Z

@coderabbitai review

coderabbitai · 2026-01-28T03:24:43Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: Patrick Riel <priel@nvidia.com>

cheese-head · 2026-01-28T03:39:01Z

@coderabbitai review

coderabbitai · 2026-01-28T03:39:17Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs`:
- Line 127: Remove the trailing whitespace on the affected line in
lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs (the line flagged
by CI around the worker module) so the file has no extraneous spaces at
end-of-line; you can delete the trailing space directly or run a whitespace
cleanup (editor trim trailing whitespace or rustfmt/trim tool) and re-run the
pre-commit hook to verify the CI error is resolved.

🧹 Nitpick comments (2)

lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs (2)

128-129: Track the TODO with an issue (or handle in this PR).

Leaving the “move this out of bindings” TODO without tracking risks it lingering.

Do you want me to propose a follow-up issue or a concrete refactor plan?

136-137: Avoid silent truncation when casting block IDs to u32.

If block_id can exceed u32::MAX, as u32 will truncate. Consider try_from (or an assertion) to make this safe.

♻️ Safer conversion

-                        for &block_id in block_ids {
-                            self.failed_block_ids.insert(block_id as u32);
-                        }
+                        for &block_id in block_ids {
+                            match u32::try_from(block_id) {
+                                Ok(block_id_u32) => {
+                                    self.failed_block_ids.insert(block_id_u32);
+                                }
+                                Err(_) => {
+                                    tracing::warn!(
+                                        request_id = %request_id,
+                                        operation_id = %uuid,
+                                        block_id,
+                                        "block_id exceeds u32::MAX; skipping"
+                                    );
+                                }
+                            }
+                        }

coderabbitai · 2026-01-28T03:44:37Z

lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs

+            failed_block_ids: HashSet::new(),
        })
    }
+


⚠️ Potential issue | 🟡 Minor

Remove trailing whitespace to fix the failing pre-commit hook.

Line 127 has trailing whitespace, which is already failing CI.

🤖 Prompt for AI Agents

In `@lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs` at line 127, Remove the trailing whitespace on the affected line in lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs (the line flagged by CI around the worker module) so the file has no extraneous spaces at end-of-line; you can delete the trailing space directly or run a whitespace cleanup (editor trim trailing whitespace or rustfmt/trim tool) and re-run the pre-commit hook to verify the CI error is resolved.

cheese-head added 2 commits January 27, 2026 00:36

pull-request-size bot added the size/L label Jan 27, 2026

github-actions bot added the feat label Jan 27, 2026

cheese-head marked this pull request as ready for review January 27, 2026 00:39

cheese-head requested review from a team as code owners January 27, 2026 00:39

cargo fmt

eac002e

Signed-off-by: Patrick Riel <priel@nvidia.com>

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs Show resolved Hide resolved

drain failures before cleanup and reuse a helper (#5712)

aa5cf2a

Signed-off-by: Patrick Riel <priel@nvidia.com>

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: get block ids with load errors #5668

feat: get block ids with load errors #5668

cheese-head commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

cheese-head commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Uh oh!

cheese-head commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: get block ids with load errors #5668

Are you sure you want to change the base?

feat: get block ids with load errors #5668

Conversation

cheese-head commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cheese-head commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Uh oh!

cheese-head commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cheese-head commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading